This notebook shows a complete example of a simple step-by-step analytical process for a sample dataset ("adult census").
Its main goal is to present how a report should be structured. Of course, the analysis process details can depend on a specific problem, but the general structure should be similar to the one presented below.
import pandas as pd
import numpy as np
import scipy.stats as sts
from pandas_profiling import ProfileReport
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold, train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
data= pd.read_csv("adult.csv")
data.head(3)
data.dtypes
An Exploratory Data Analysis part should happen here. It will depend on data types and research problem. An automated exploration library (Pandas Profiling) was used to save time and space in this example.
Typically, EDA should be extensively commented - both from a technical perspective as well as business-related one.
profile = ProfileReport(data, title='Dataset Report', explorative=True)
profile.to_notebook_iframe()
Data preparation steps happen here - all necessary operations to make the data usable in a further modeling process.
X, y = data.drop("earnings", axis=1), data.earnings
y
y.unique()
y = LabelEncoder().fit_transform(y)
y
X_enc = pd.get_dummies(X)
X_enc
Typically, test data is saved for final evaluation. All algorithms tuning and parameters search happen on train data (which can be divided again into train-validation).
X_train_val, X_test, y_train_val, y_test = train_test_split(X_enc, y, test_size=0.2, random_state=123)
X_train_val.head(3)
X_test.head(3)
Usually, multiple algorithms are tested. A good practice is to pick one from each promising family (trivial algorithms, tree-based models, ensembles, etc).
algorithms = {
'knn': KNeighborsClassifier(),
'dt': DecisionTreeClassifier(),
'rf': RandomForestClassifier()
}
kfold = KFold(n_splits=15, random_state=456)
It would be best if you did this for every single classifier in your set. It can take some time to complete the job, so the example here is presented only for decision trees.
tree_params_grid = {
'max_depth': [3, 5, 10, 20],
'criterion': ['gini', 'entropy'],
'max_features': [None, 5, 10, 20]
}
grid_tree = GridSearchCV(DecisionTreeClassifier(), tree_params_grid, cv=kfold, n_jobs=3)
grid_tree_results = grid_tree.fit(X_train_val, y_train_val)
# check which estimator is the best?
grid_tree_results.best_estimator_
Save the best estimator as your algorithm of choice to compare with the others
algorithms['dt'] = grid_tree_results.best_estimator_
If only you have sufficient data volume - perform cross-validation. Judging algorithms' performance using a single trail is misleading. It should always be verified multiple times to avoid false discoveries.
results = {}
for algo_name, algo in algorithms.items():
algo_results = cross_val_score(algo, X_train_val, y_train_val, cv=kfold, n_jobs=4)
results['model_' + algo_name] = algo_results
Once results from multiple algorithms have been collected - an analyst should compare them using statistical tools. Typically - statistical tests for the difference of means are performed. Judging if one algorithm outperforms the other by eye is not the best idea for scientific research.
results_df = pd.DataFrame.from_dict(results)
results_df.mean(axis=0)
results_df.std(axis=0)
results_df
Friedman chi-square test to check if IN GENERAL there are differences between classifiers
sts.friedmanchisquare(results_df.model_knn, results_df.model_rf, results_df.model_dt)
We can see that differences are important. Now we should perform POST HOC tests to see which classifiers are different
More details about post-hoc tests can be found here
First option will be to use statsmodels library, that can run post-hoc tests of multiple comparisons. Documentation can be found here.
WARNING - this library is unstable and changes often. This tutorial might not be actual after couple of days.
Multiple comparisons procedure requires dataframe in a form:
| Method | Score |
|---|---|
| Classifier1 | score1 |
| Classifier1 | score2 |
from statsmodels.sandbox.stats.multicomp import MultiComparison
results_df2 = results_df.copy()
results_df2['id'] = results_df2.index
results_reformatted = pd.melt(results_df2, id_vars='id').drop('id', axis=1)
results_reformatted
multicomp = MultiComparison(
data=results_reformatted.value,
groups=results_reformatted.variable,
)
print(multicomp.tukeyhsd())
We can see, that all differences are important EXCEPT random forest vs tree - there's no significant difference between them (remember: we have tuned Decision Tree as much as we can)
You can use scipy-posthocs library.
This library uses the same format as statsmodels above.
We will use Tukey pos-hoc test for Friedman test (multiple models).
WARNING this library is also experimental and can change rapidly. Some of those
import scikit_posthocs as sp
pvals = sp.posthoc_tukey(results_reformatted, val_col='value', group_col='variable', )
pvals < 0.05
pvals
Once the analysis is done, and the performance of different methods has been compared - one should prepare a final chapter presenting the following aspects: